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Abstract 

The degree distribution of many biological and technological networks 
has been described as a power-law distribution. While the degree distribu- 
tion does not capture all aspects of a network, it has often been suggested 
that its functional form contains important clues as to underlying evolu- 
tionary processes that have shaped the network. Generally, the functional 
form for the degree distribution has been determined in an ad-hoc fashion, 
with clear power-law like behaviour often only extending over a limited 
range of connectivities. Here we apply formal model selection techniques 
to decide which probability distribution best describes the degree dis- 
tributions of protein interaction networks. Contrary to previous studies 
this well defined approach suggests that the degree distribution of many 
molecular networks is often better described by distributions other than 
the popular power-law distribution. This, in turn, suggests that simple, if 
elegant, models may not necessarily help in the quantitative understand- 
ing of complex biological processes. 



1 Introduction 

Technological advances seen in molecular biology and genetics increasingly pro- 
vide us with vast amounts of data about genomic, proteomic and metabolomic 
network structures [1-3]. Understanding the way in which the different con- 
stituents of such networks, — proteins in the case of protein interaction net- 
works (PIN) — interact is believed to yield important insights into basic biolog- 
ical mechanisms [4, 5]. For example the extent of phenotypic plasticity allowed 
for by a network, or levels of similarity between molecular networks in different 
organisms, presumably depend at least to some extent on topological (in a loose 
sense of the word) properties of networks. 

The degree distribution, the number of nodes n{k) that have k connections 
to other nodes is one of the important characteristics of networks [6,7]. The 
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observed degree distribution can be used to define an empirical probability dis- 
tribution. If N is the total number of nodes in the network then the probability 
that a node has k edges is defined via Pr(fc) = n{k)/N] both Pr(fc) and n{k) 
are often referred to as the degree distribution. It is widely understood that the 
degree distribution is only one, and by no means the most important summary 
statistic of a network. Other frequent measures are the clustering coefficient, 
network diameters as well as motif frequencies and graph spectra, but so far 
degree distributions are generally the most widely studied characteristic. 

It has frequently been suggested that natural and technological networks, in- 
cluding PINs, show scale- free behaviour and that the degree distribution follows 
a power-law Pr(/c) = fc^'''/C(7)i where ^(7) is Riemann's zeta-functions which 
is defined for x > 1 and diverges as 7 ^ 1 j; for finite networks, however, it is 
not necessary that the value of 7 is restricted to values greater than 1. Indeed 
most, if not all, empirical degree distributions tail off slower than exponentially: 
real biological, technological and social networks tend to have a few nodes with 
many more connections than would be expected for classical, or Erdos-Renyi, 
random graphs [8]. But so far treatments have focused on fitting power-law (or 
heuristically derived finite-size versions of power-laws) to the observed degree 
distributions. 

The notion of "scale-free" has a precise mathematical meaning. If a function 
/(fc) is scale-free then the ratio f{ak)/f{k) depends only on a but not on 
k. Most empirical degree distributions lack this property, at least globally. If 
plotted on a log-log plot then many degree distributions do indeed take on the 
shape of a straight line, at least over a range of connectivities, but never over 
the whole range of connections. 

Here we will determine which probability model best describes the degree 
distribution over the whole range of degrees for a number of different net- 
works. Our trial distributions are chosen from among the distributions which 
are known to occur for theoretical networks models (from graph theory or sta- 
tistical physics), supplemented by some well-known probability distributions 
with fat tails. In addition to the Poisson distribution (exp(— A)A'^/A;! for all 
fc > 0; Ml) and the power-law {k~"' /(^{j); M4) we will also consider the 
exponential (Cexp(— fc/fc) for all fc > with normalizing constant C; M2), 
the Gamma distribution (fc'''^^e~''/r(7) for all fc > 0; M3), the log-normal 
j.^g_in((fe-e)/m)V(2a2)/[(-^^_ Q-^^rV^] for all fc > 0; M5) and the stretched 
exponential (C exp(— afc/fc)fc~''' for fc > 0; M6). As we will see, power-law dis- 
tributions do not always perform better than the other fat-tailed distributions. 

2 Likelihood analysis of the degree distribution 
of a network 

We briefly introduce the basic statistical concepts employed later. These can be 
found in much greater detail in most modern statistics texts such as [?] . The like- 
lihood of a statistical model, M, given the observed data D — {Di, D2, ■ ■ ■ , Dn} 
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is defined via 

n 

L{M) (xPr{D\M) = Y[Pr{Di\M). (1) 

1=1 

Taking logarithms on both sides of Eqn. (1) yields the log-likelihood, and since 
the proportionality constant in Eqn. (1) may depend on the data but not on the 
underlying statistical model (i.e. the probability distribution), we may write 

n 

lk(Af) = ^log(Pr(A)|M). (2) 

i=l 

The model M is parameterized by a set of v parameters 9 = {Oj} and 
the maximum likelihood estimates (MLE), 6, of the parameters are the values 
for which the expressions on the left-hand side of Eqns. (1) and (2) become 
maximal. For these values the observed data is more likely to occur than for 
any other parameters. 

Ultimately, however, we would like to be able to determine how much better, 
for example, a scale-free model is at explaining an observed degree distribution 
than a Poisson model. For non-nested models we have to employ an information 
criterion such as the Akaike information criterion (AIC) [9, 10] or the Bayesian 
information criterion (BIC). The AIC for a model Mj is defined via 

AIC, =2(-lk(A/,) + z;,) (3) 

where Vj is the number of parameters needed to define model Mj. The model 
with the minimum AIC is chosen as the best model and the information crite- 
ria balance a model's power against its complexity [9, 10]. In order to compare 
different models we define the relative differences A^^*^ = AIC, — unn{AIC). 
This in turn allows us to estimate the relative likelihoods of the models C{Mj) oc 
exp(— A^^'-^/2). Normalizing these relative likelihoods yields the so-called Akaike 
weights Wj 



exp(-AfC/2) 

Ui = ; . 



(4) 



The Akaike weight Wj can be interpreted as the probability that model AIj (out 
of the J alternative models) is the best model given the observed data. The 
relative support for one model over another is thus given by the ratio of their 
respective Akaike weights. The Akaike weight formalism is very flexible and has 
been applied in a range of contexts, including the assessment of confidence in 
phylogenetic inference [11]. The equivalent quantities for the BIC, which arises 
as a limit of formal Bayesian model selection are called Schwarz weights and 
can be interpreted in the same manner. In the next section we will apply this 
formalism to PIN data from five species and estimate the level of support for 
each of the models discussed above. 
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Model 


Ml 


M2 


M3 


M4 


M5 


M6 


Nr. of parameters 


1 


1 


1 


1 


2 


3 


D. melanogaster 


-38273 


-20224 


-29965 


-18520 


-17835 


-17820 


C. elegans 


-9017 


-5975 


-6071 


-4267 


-4328 


-4248 


S. cerevisiae 


-24978 


-14042 


-20342 


-13459 


-12713 


-12759 


H. pylori 


-2552 


-1776 


-2052 


-1595 


-1527 


-1529 


E. coll 


-834 


-884 


-698 


-799 


-659 


-701 



Table 1 : Log-likelihoods for the six degree distributions discussed in the text for 
PIN data collected from five model organisms. The likelihoods of the models 
with the highest Akaike weights (which is always uji « 1) are indicated in bold. 



3 Results 

In table 1 we show the likelihoods for the degree distributions calculated from 
PIN data collected in five model organisms [12] (the protein interaction data 
was taken from the DIP data-base; http://dip.doe-mbi.ucla.edu). We find that 
the standard scale-free model never provides the best fit to the data; in three 
networks {C. elegans, S. cerevisiae and E.coli) the lognormal distribution (M5) 
explains the data best. In the remaining two organisms the stretched scale-free 
model provides the best fit to the data. The bold likelihoods correspond to 
the highest Akaike weights. Apart from the case of H. pylori (where max(wj) = 
« 0.95 for M5 and « 0.05) the value of the maximum Akaike weight is 
always > 0.9999. 

For the yeast PIN the best fit curves (obtained from the MLEs of the pa- 
rameters of models M1-M6) are shown in figure 1, together with the real data. 
Visually, log-normal (green) and stretched exponential (blue) appear to describe 
the date almost equally well. Closer inspection, guided by the Akaike weights, 
however, shows that the fit of the lognormal to the data is in fact markedly 
better than the fit of the stretched exponential. But the failure of quickly de- 
caying distributions such as the Poisson distribution, characteristic for classical 
random graphs [8] to capture the behaviour of the PIN degree distribution is 
obvious. 

Figure 2 shows only the three curves with the highest values of uJi, which 
apart from E.coli are the lognormal, stretched exponential and power-law dis- 
tributions; for E.coli, however, the Gamma distribution replaces the power-law 
distribution. These figures show that, apart from C. elegans the shape of the 
whole degree distribution is not power-law like, or scale-free like, in a strict 
sense. Again we find that lognormal and stretched exponential distributions are 
hard to distinguish based on visual assessment alone. Figures 1 and 2, together 
with the results of table 1, reinforce the well known point that it is hard to 
choose the best fitting function based on visual inspection. It is perhaps worth 
noting, that the PIN data is more complete for S. cerevisiae and D .melanogaster 
than for the other organisms. 

The standard scale-free model is superior to the lognormal only for C. elegans. 
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s.cerevisiae 




1 2 3 4 6 8 11 16 23 33 47 67 95 141 220 



Figure 1: Yeast protein interaction data (o) and best-fit probability distribu- 
tions: Poisson ( — ), Exponential ( ), Gamma ( — ), Power-law ( — ), Log- 
normal ( — ), Stretched exponential ( — ). The parameters of the distributions 
shown in this figure arc the maximum likelihood estimates based on the real 
observed data. 

The order of models (measured by decreasing Akaike weights) is M6, M5, M4, 
M2, M3, Ml for D.melanogaster, M6, M4, M5, M2, M3, Ml for C.elegans, M5, 
M6, M4, M2, M3, Ml for S.cerevisiae and H.pylori, and M5, MS, M6, M4, M2, 
Ml for E.coli. Thus in the light of present data the PIN degree distribution of 
E. coli lends more support to a Gamma distribution than to a scale- free (or even 
stretched scale-free) model. There is of course, no mechanistic reason why the 
gamma distribution should be biologically plausible but this point demonstrates 
that present PIN data is more complicated than predicted by simple models. 
Therefore statistical model selection is needed to determine the extent to which 
simple models really provide insights into the intricate architecture of PINs. 
While none of the models considered here will be anywhere close enough to the 
true model it is apparent from the present study that there is as yet no simple 
probability model that could explain all PIN degree distributions satisfactorily, 
including the flexible stretched exponential [13]. 

We have also determined the likelihoods of the data for two heuristic finite- 
size versions of the powerlaw[2, ?]: their performance is markedly increased 
compared to M4 but still behind that of the log-normal and stretched exponen- 
tial (data not shown). Moreover, we simulated 1000 scale- free networks with 
400 nodes (using the mathematically well defined LCD construction) and found 
that the AIC always favoured the scale-free model over the lognormal distri- 
bution; measured by the AIC, however, stretched exponential and powerlaw 
have comparable power at explaining the data with a slight advantage for the 
more flexible model M6 (the average Akaike weight for the lognormal model is 
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smaller than 10~^°). Thus the power of model-selection extends even to rela- 
tively small networks and it is therefore unlikely that the above observations 
can be explained solely by finite size effects. 

For completeness we note that model selection based on BIC results in the 
same ordering of models as the AIC shown here. 

4 Discussion 

Several approaches have been developed which aim to describe networks more 
completely than is possible by the degree distribution (see e.g. [14-16]) and 
this will continue to be an important area of research for some time. The aims 
of the present study, however, was to determine if the conclusions which have 
been drawn from analysis of degree distributions are statistically sound. This 
is relevant as the degree distribution is still seen as the litmus test of whether a 
network is scale-free; there certainly is no evidence of that for PINs. 

There still remains the point as to how much we can expect real networks 
to conform to simple models. Intriguingly it has often been admitted that the 
degree distribution exhibits power-law behaviour only over small ranges of k, 
perhaps because of finite-size effects. Quite generally, and as pointed out by 
Jensen[17], powerlaw behaviour should extend over several decades in order to 
be meaningfully interpreted (and inferred). In our analysis of finite-size networks 
we found however, that the approach used here is able to "distinguish" reliably 
between the fat-tailcd distributions and will call the scalc-frcc model in the vast 
number of cases. 

Here we investigated only a small number of degree distributions and we 
do not claim that they are anywhere similar enough to the true distribution 
to be meaningful representations of the underlying network. We do find, how- 
ever, that fat-tailed degree distributions, like power-laws and the lognormal and 
stretched exponential distributions offer quantitatively better descriptions of 
the data than distributions which decay rapidly. However, the popular scale- 
free models often do a disappointing job at capturing the quantitative (to some 
extent even the qualitative) characteristics of protein interaction networks com- 
pared to lognormal and stretched exponential distributions. 

It has to be kept in mind that model selection differs considerably from 
hypothesis tests [10]: it determines which model, from a given set of models, 
can best explain the data. Formal model selection does not, however, test if 
data was drawn from a particular probability distribution. Crucially, if the true 
model is not included in the trial set {Mj } then model selection will nevertheless 
determine a rank ordering of the models Mj even though they may be very 
different from the true model. 

There are a number of reasons which may contribute to the failure of theoreti- 
cal models to describe real degree distributions: (i) PIN data is notoriously noisy 
and plagued by false-positive and false-negative results. Curated databases like 
DIP, however, try to keep the number false-positive interactions low. (ii) Mod- 
els are often formulated for, or solved in, the thermodynamic limit (network 
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C.elegans 



D.melanogaster 




Figure 2: Degree distributions of the protein interaction networks (o) of 
C.elegans, D.melanogaster, E.coli and H. pylori. The power-law ( — ), lognormal 
( — ) and stretched exponential ( — ) models are shown for all figures; for E.coli 
the gamma distribution ( — ), which performs better (measured by the Akaike 
weights) than either scale- free and the stretched exponential distributions. 



size — > oo) while real networks are finite. Finite size effects can, how- 
ever, often be straightforwardly incorporated into the formalism (either heuris- 
tically or through explicit numerical modelling) . (iii) The evolutionary process 
is much more complicated, contingent and erratic, than the models from sta- 
tistical physics: yeast, for example, has undergone a whole genome duplication 
some 10^ years ago[18]. Such events no doubt affect PIN organization but in a 
way that is difficult to model statistically, (iv) Most networks studied to date 
do not represent the whole network but are in fact smaller subnets sampled 
from the whole network. Depending on the sampling process the subnet can 
differ radically from the overall network. Finally, (v) biological networks may 
not be best described by static graphs. No doubt interactions between proteins 
depend on external stimuli and are conditional on the presence or absence of 
other chemical entities. 
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Technological advances, as well as refined statistical tools, will over time 
address the first point. The second point is mainly technical/computational. 
The remaining three issues, however, are of considerable interest to theoreti- 
cal biologists and statistical physicists. Statistical network ensembles that are 
more flexible than the simple scale-free models [19] need to be investigated more 
systematically. But even if a mechanistic model is not correct in detail, a cor- 
responding statistical ensemble may nevertheless offer important insights [20], 
and the Akaike formalism will help to keep model complexity at an acceptable, 
though necessary, level (our results indicate there is no danger of over-fitting 
present PIN data as yet). 

In summary, we have shown how statistical methods can be applied to net- 
work data. These methods suggest that simple, if elegant, models from statisti- 
cal physics need to be refined in order to gain quantitative insights into network 
evolution. We believe that the statistical models employed here will also be 
useful in helping to identify more realistic ensembles of network models. 
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